xxxxxxxxxx<P> <img src="https://i.ibb.co/gyNf19D/nhslogo.png" alt="nhslogo" border="0" width="100" align="right"><font size="6"><b> CS4132 Data Analytics</b> </font>
CS4132 Data Analytics
xxxxxxxxxx<div class="alert alert-block alert-danger">Important Note: Please keep your report concise and relevant (i.e. show only relevant steps and visualizations used to answer your research questions).</div>xxxxxxxxxx# Table of Content (with relevant hyperlinks to sections)xxxxxxxxxx1. [Motivation & Background](#motivbg) 2. [Research Questions and Summary of Answers](#qns) 3. [Datasets](#dataset)4. [Methodology](#method)5. [Data Acquisition](#acquire) 6. [Data Cleaning](#cleaning) 7. [EDA](#eda) 8. [Results Findings & Conclusion](#results) 9. [Recommendations or Further Works](#recommendations) 10. [References](#ref) xxxxxxxxxx<div class="alert alert-block alert-warning">Give an overview of the project, motivation, background and goals.</div>xxxxxxxxxx<p> American crosswords are word puzzles, in which the goal is to fill in all the white squares with letters to fit the given clues. The special rules of American crosswords in particular are that each square must be used twice and each word has to be at least 3 letters long. The answers are a mix of trivia and common phrases. There are two main criticisms of American crosswords. </p><p>The first criticism is how accessible it is. Sometimes, obscure terms must be used to fill the grid, as the constructor cannot find a better configuration. These obscure terms are called "crosswordese". They appear commonly in puzzles as they have convienent letter patterns. However, as computers have evolved to assist humans in construction, the quality of puzzles have been getting better and better. </p><p>The second criticism is how representative it is. Crosswords have been said to reflect a piece of personality from the contructor. Originally, crosswords were for straight liberally-educated white men. As time passed, people realised that there was a lack of representation in terms of answers or clue-writing for the other groups. Hence, there has been a push to include more of these people into the crossword. This includes mentorship to women/people of colour/LGBTQ people to construct crosswords. </p><p>This project would aim to find how accessible the crossword currently is, given the rise of computers as a construction aid. In a similar fashion, it would also like to find out how representative the crossword is of minorities. </p> American crosswords are word puzzles, in which the goal is to fill in all the white squares with letters to fit the given clues. The special rules of American crosswords in particular are that each square must be used twice and each word has to be at least 3 letters long. The answers are a mix of trivia and common phrases. There are two main criticisms of American crosswords.
The first criticism is how accessible it is. Sometimes, obscure terms must be used to fill the grid, as the constructor cannot find a better configuration. These obscure terms are called "crosswordese". They appear commonly in puzzles as they have convienent letter patterns. However, as computers have evolved to assist humans in construction, the quality of puzzles have been getting better and better.
The second criticism is how representative it is. Crosswords have been said to reflect a piece of personality from the contructor. Originally, crosswords were for straight liberally-educated white men. As time passed, people realised that there was a lack of representation in terms of answers or clue-writing for the other groups. Hence, there has been a push to include more of these people into the crossword. This includes mentorship to women/people of colour/LGBTQ people to construct crosswords.
This project would aim to find how accessible the crossword currently is, given the rise of computers as a construction aid. In a similar fashion, it would also like to find out how representative the crossword is of minorities.
xxxxxxxxxx# Summary of Research Questions & Results<a id='qns'/>xxxxxxxxxx<div class="alert alert-block alert-warning">Repeat your research questions in a numbered list. After each research question, clearly state the answer/conclusion you determined. Do not give details or justifications yet — just the answer</div>xxxxxxxxxxAccessibility:<p>1. <b>"Crosswordese"</b> How has the amount of obscure answers changed throughout the years? Crosswordese is the use of an obscure word with a convienent letter pattern, with many common letters or vowels sometimes to fill in the grid. These words make it hard for people to do them, if they are not part of an "in-group" that knows all these common crossword words. Hence, I would like to find out, how much "crosswordese" the crossword has over the years.</p>- Crosswordese has only somewhat decreased through the years, not a very significant change.- Crosswordese slightly increases throughout the week, making it less accessible for newbies.<p>2. <b>Freshness</b>How has the "freshness" factor of crossword changed over the years? Crosswords are a reflection of the world, what thecurrent trends are and such. With more and more crosswords in the pool, the number of never seen before terms and names in the crossword has steadily decreased. However, words and phrases get coined every single day, some catching on in modern language. This question aims to investigate that, coupled with how computers have helped give more liberty to filling the grid. </p> - Freshness has increased throughout the years, giving a better puzzle.- Freshness increases through the week, as puzzles become more loose in theme. <p>Representation:<p>3. <b>Inclusive Clues</b>How has the clue-writing changed over the years? Clue-writing is half of the puzzle. This could introduce some unwanted sterotypes. For example, the clue-writing for the answer MIT has been associated with males more than females, showing a bias in thinking that men are more prominent in tech. Hence, by finding out the mentions of minorities in clue-writing, one can find out how progressive the puzzle has become.</p>- Female names are used less than male names in clues and answers combined.- However, the use of names has also decreased through the years.- Some outlets strive to make female names used as much as male names, and a shift to equality has been seen in recent years. <p><p>4. <b>Constructors</b>How has the make-up of constructors changed over the years, and modified the quality of crosswords? As said, a crossword reflects a person's experiences and views. With more diverse make-up of constructors, there will be more variety for that. There was a time where the constructors were mostly men, skewing the quality for some. With the rising number of mentorships given to minorities by prolific constructors, however, there has been an uptick in minority constructors. This question would like to analyse the trend as a whole, as well as possibly see the impact of mentorship. </p>- Women and new constructors are seen to mostly construct only early-week puzzles.- Female representation in construction has positive impacts on inclusivity.- Collaborations boost the quality of crosswords for all. <p>Accessibility:
1. "Crosswordese"
How has the amount of obscure answers changed throughout the years? Crosswordese is the use of an obscure word with a convienent letter pattern, with many common letters or vowels sometimes to fill in the grid. These words make it hard for people to do them, if they are not part of an "in-group" that knows all these common crossword words. Hence, I would like to find out, how much "crosswordese" the crossword has over the years.
How has the "freshness" factor of crossword changed over the years? Crosswords are a reflection of the world, what the current trends are and such. With more and more crosswords in the pool, the number of never seen before terms and names in the crossword has steadily decreased. However, words and phrases get coined every single day, some catching on in modern language. This question aims to investigate that, coupled with how computers have helped give more liberty to filling the grid.
Representation:
3. Inclusive Clues
How has the clue-writing changed over the years? Clue-writing is half of the puzzle. This could introduce some unwanted sterotypes. For example, the clue-writing for the answer MIT has been associated with males more than females, showing a bias in thinking that men are more prominent in tech. Hence, by finding out the mentions of minorities in clue-writing, one can find out how progressive the puzzle has become.
4. Constructors
How has the make-up of constructors changed over the years, and modified the quality of crosswords? As said, a crossword reflects a person's experiences and views. With more diverse make-up of constructors, there will be more variety for that. There was a time where the constructors were mostly men, skewing the quality for some. With the rising number of mentorships given to minorities by prolific constructors, however, there has been an uptick in minority constructors. This question would like to analyse the trend as a whole, as well as possibly see the impact of mentorship.
xxxxxxxxxximport aiohttpimport asyncioimport pandas as pdimport numpy as npfrom bs4 import BeautifulSoup as bsfrom matplotlib import pyplot as pltimport seaborn as snsimport plotly.express as pxxxxxxxxxxx<div class="alert alert-block alert-warning">Numbered list of dataset (with downloadable links) and a brief but clear description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).</div>xxxxxxxxxx1. https://www.crosswordgiant.com/browse (website to scrape for the clue answer pairs)2. https://www.xwordinfo.com/ (contains more data about the NYT Crossword in particular)3. https://books.google.com/ngrams/ (for word searching)4. https://peterbroda.me/crosswords/wordlist/lists/peter-broda-wordlist__scored.txt (crossword construction wordlist by Peter Broda)5. https://drive.google.com/uc?export=download&id=1Ruxn8XzRNstU6sDPOMm_K72fVookrPPr (crossword construction wordlist by Brooke Husic and Enrique Henestroza Anguiano)6. https://www.verywellfamily.com/top-1000-baby-boy-names-2757618 (boy name list)7. https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 (girl name list)xxxxxxxxxx<div class="alert alert-block alert-warning">You should demonstrate the data science life cycle here (from data acquisition to cleaning to EDA and analysis etc).</div>xxxxxxxxxx<div class="alert alert-block alert-info">Display the data which will be used in the project. The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report. Import and display each dataset in a dataframe. For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.</div>The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report.
Import and display each dataset in a dataframe.
For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.
xxxxxxxxxxMany of these datasets were scraped from the internet. The scraping code can be found in the Appendix.Many of these datasets were scraped from the internet. The scraping code can be found in the Appendix.
xxxxxxxxxx### CrosswordGiant.com</p>This data was scraped from CrosswordGiant.com, looking at all the pages, then filtering and sorting the relevant publication outlets. The 5 DataFrames below are very similar, they differ only in the outlet column. Inside each DataFrame, there are 4 columns. <p></p>Clue - What hint was given to the answer in this crossword? <p></p>Answer - What is the expected answer to the given clue? <p></p>Outlet - Where was the crossword published? <p></p>Date - When was the crossword published?<p>Clue - What hint was given to the answer in this crossword?
Answer - What is the expected answer to the given clue?
Outlet - Where was the crossword published?
Date - When was the crossword published?
xxxxxxxxxxnyt=pd.read_csv("New York Times.csv")lat=pd.read_csv("L.A. Times Daily.csv")uni=pd.read_csv("Universal.csv")usat=pd.read_csv("USA Today.csv")wsj=pd.read_csv("Wall Street Journal.csv")nyt.head()xxxxxxxxxx### XWordInfo.comThis dataset is obtained from XWordInfo, a site with extensive information on New York Times Crosswords. The dataset is named NYTCI, standing for New York Times Constructor Info. The dataset has 8 columns. First two are the Day Of Week and Date. Some crosswords made are collaborations, made with up to 3 people, hence C1,C2 and C3, which stand for Constructor 1, 2 and 3 respectively. For some crosswords, there are less than 3 constructors, hence their columns are dashed. C1,C2,C3 No. stand for the order of the puzzle the author has published up til now. C1, C2, C3 Gender stand for the genders of Constructors. This dataset is obtained from XWordInfo, a site with extensive information on New York Times Crosswords. The dataset is named NYTCI, standing for New York Times Constructor Info. The dataset has 8 columns. First two are the Day Of Week and Date. Some crosswords made are collaborations, made with up to 3 people, hence C1,C2 and C3, which stand for Constructor 1, 2 and 3 respectively. For some crosswords, there are less than 3 constructors, hence their columns are dashed. C1,C2,C3 No. stand for the order of the puzzle the author has published up til now. C1, C2, C3 Gender stand for the genders of Constructors.
xxxxxxxxxxxwi=pd.read_csv("NYTCI.csv")xwi.head()xxxxxxxxxx### Google NGramTaking the some of the most common answers over all crosswords, around 15K, we input it into Google NGram as an URL, which gives us scrapable data. We then take the relevant years of our crosswords, 1990-2020, and insert it into a DataFrame.However, sometimes the data is missing, for example I found no data for "ISNT". Since this is uncommon and shows no pattern that can be seen, it is reasonable to assume randomness and just ignore it.Here, there are 3 different files as some answers were scraped in different sessionsThe resulting DataFrame is displayed. Taking the some of the most common answers over all crosswords, around 15K, we input it into Google NGram as an URL, which gives us scrapable data. We then take the relevant years of our crosswords, 1990-2020, and insert it into a DataFrame. However, sometimes the data is missing, for example I found no data for "ISNT". Since this is uncommon and shows no pattern that can be seen, it is reasonable to assume randomness and just ignore it. Here, there are 3 different files as some answers were scraped in different sessions The resulting DataFrame is displayed.
xxxxxxxxxxngram=pd.read_csv("CommonAnswers.csv")ngram2=pd.read_csv("CommonAnswers2.csv")ngram3=pd.read_csv("CommonAnswers3.csv") #scraping from different sessionsngram.set_index("Year",inplace=True)ngram2.set_index("Year",inplace=True)ngram3.set_index("Year",inplace=True)ngram=pd.merge(left=ngram,right=ngram2,left_index=True,right_index=True)ngram=pd.merge(left=ngram,right=ngram3,left_index=True,right_index=True)ngram.head()xxxxxxxxxx### Crossword Wordlists<p> In crossword construction, wordlists are used. These wordlists are fed into a program, which will help suggest the best configuration for a particular section, or even for the whole grid. As such, the wordlists are as comprehensive as possible, trying to maximise the number of configurations to pick from, to pick the best one to human eyes. These wordlists are usually scored by the author as well, giving a score of how good, in their opinion, an answer is.Using these wordlists, we can also check how "good" each crossword is, giving a quantifiable amount of weight to each asnwer. <p>Of course, these wordlists are biased based on who makes it. Hence, we shall try to use 2 independent wordlists to cross-check. There are many wordlists out there, however, these two are chosen as they are very comprehensive, but also free. In crossword construction, wordlists are used. These wordlists are fed into a program, which will help suggest the best configuration for a particular section, or even for the whole grid. As such, the wordlists are as comprehensive as possible, trying to maximise the number of configurations to pick from, to pick the best one to human eyes. These wordlists are usually scored by the author as well, giving a score of how good, in their opinion, an answer is. Using these wordlists, we can also check how "good" each crossword is, giving a quantifiable amount of weight to each asnwer.
Of course, these wordlists are biased based on who makes it. Hence, we shall try to use 2 independent wordlists to cross-check. There are many wordlists out there, however, these two are chosen as they are very comprehensive, but also free.
xxxxxxxxxxlistA=pd.read_csv("peter-broda-wordlist__scored.txt",delimiter=";",header=None,dtype={'Column 1':int})listA.columns=["Answer","Score"]listA.set_index("Answer",inplace=True)listA.head()xxxxxxxxxxlistB=pd.read_csv("spreadthewordlist_caps.txt",delimiter=";",header=None,dtype={'Column 1':int})listB.columns=["Answer","Score"]listB.set_index("Answer",inplace=True)listB.head()xxxxxxxxxx### Girl's/Boy's Name DatabaseThis dataset was acquired by simply going to the website and doing a copy-paste.This database will be used in Q3 to look for occurrences of their names. It will only be used as a lookup, not really a dataframe.This dataset was acquired by simply going to the website and doing a copy-paste. This database will be used in Q3 to look for occurrences of their names. It will only be used as a lookup, not really a dataframe.
xxxxxxxxxxgNames=pd.read_csv("girl_names.txt",header=None)gNames=set(gNames[0])bNames=pd.read_csv("boy_names.txt",header=None)bNames=set(bNames[0])xxxxxxxxxx<div class="alert alert-block alert-info">For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any).If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.</div>xxxxxxxxxxSince most of the data is scraped, I have been able to control the cleaniness of data, therefore, the quality and cleaniness of the data was high. Of course there were some hitches during the data collection. Missing data is rare and may not exist in the dataset.For two of the datasets, CrosswordGiant and NGram, there was the possibility of the data not existing. This was simply handled by catching the exception/error that occurs when I tried to process the empty data. Hence, it is ensured that no data that is invalid is entered into the saved file.Again, the code is found in the Appendix.For the namelist, no cleaning is required; that has already been done by the publisher.For checking symmetry, the dataset is very simple and acquired by scraping. Although some of the entries are incorrect, they are at random. This was caused by a logic error that I did not have the skills to fix. However, this should not affect the results significantly. However, no cleaning is requried.Since most of the data is scraped, I have been able to control the cleaniness of data, therefore, the quality and cleaniness of the data was high. Of course there were some hitches during the data collection. Missing data is rare and may not exist in the dataset.
For two of the datasets, CrosswordGiant and NGram, there was the possibility of the data not existing. This was simply handled by catching the exception/error that occurs when I tried to process the empty data. Hence, it is ensured that no data that is invalid is entered into the saved file. Again, the code is found in the Appendix.
For the namelist, no cleaning is required; that has already been done by the publisher. For checking symmetry, the dataset is very simple and acquired by scraping. Although some of the entries are incorrect, they are at random. This was caused by a logic error that I did not have the skills to fix. However, this should not affect the results significantly. However, no cleaning is requried.
xxxxxxxxxx### CrosswordGiant<p>For this dataset in particular, some webpages have garbage data, with the answers just being XXXXXXXX or being duplicated many times. The former is harder to detect, and can be cleaned later, when answering question 1. The latter can easily be removed by checking how many entries the crossword of the particular day has, then just removing them.<p>Sometimes, publications fill their crosswords with puns, which are bogus words, without a theme.These gimmicks are hard to detect, and unfortunately CrosswordGiant is unable to detect such cases.This problem is difficult to solve, as it is a linguistical one, and not within the scope of this project.Some introduction is required here. Bogus words follow a theme, and themed crosswords appear on certain days of the week only. By and large, it is reasonable to assume that bogus words appear at random, and are independent between crosswords.Hence, these will be the steps for cleaning this dataset.1. Given the date, find the day of the week.2. Toss out known puzzles with bogus words.3. Find the puzzles with too many clues and discard them.Then, we just merge them.While doing the project, I found that this dataset was missing some New York Times crossword from around 2000. This missing it is not crucial to the project, and hence can be ignored.For this dataset in particular, some webpages have garbage data, with the answers just being XXXXXXXX or being duplicated many times. The former is harder to detect, and can be cleaned later, when answering question 1. The latter can easily be removed by checking how many entries the crossword of the particular day has, then just removing them.
Sometimes, publications fill their crosswords with puns, which are bogus words, without a theme. These gimmicks are hard to detect, and unfortunately CrosswordGiant is unable to detect such cases. This problem is difficult to solve, as it is a linguistical one, and not within the scope of this project. Some introduction is required here. Bogus words follow a theme, and themed crosswords appear on certain days of the week only. By and large, it is reasonable to assume that bogus words appear at random, and are independent between crosswords.
Hence, these will be the steps for cleaning this dataset.
While doing the project, I found that this dataset was missing some New York Times crossword from around 2000. This missing it is not crucial to the project, and hence can be ignored.
xxxxxxxxxxnyt["Day"]=-1lat["Day"]=-1usat["Day"]=-1uni["Day"]=-1wsj["Day"]=-1 #create new row, set to -1 as we do not know yetdef dayOfDate(row): months=["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"] m,d,y=row.Date.split(" ") ts=pd.Timestamp(year=int(y),month=months.index(m)+1,day=int(d)) row.Day=ts.dayofweek row.Date=ts return rowxxxxxxxxxxnyt=nyt.apply(dayOfDate,axis=1)lat=lat.apply(dayOfDate,axis=1)wsj=wsj.apply(dayOfDate,axis=1)usat=usat.apply(dayOfDate,axis=1)uni=uni.apply(dayOfDate,axis=1) #this code takes long to run, due to the huge dataset#thats why it has also been saved to a new filexxxxxxxxxxcrosswords=pd.concat([nyt,lat,uni,usat,wsj])crosswords.drop_duplicates(inplace=True)crosswords.to_csv("Clean XWs.csv")xxxxxxxxxxcrosswords=pd.read_csv("Clean XWs.csv",index_col=0)crosswords.drop_duplicates(inplace=True)badEntries=crosswords.groupby(["Outlet","Date"])[["Answers"]].count()badEntries.reset_index(inplace=True)badEntries=badEntries[badEntries.Answers>150] #remove crosswords with more than 150 answers, those are incorrect entriesbadEntries.head()xxxxxxxxxxcrosswords["Date"]=pd.to_datetime(crosswords["Date"])crosswords["Year"]=crosswords["Date"].dt.strftime("%Y")crosswords["Year"]=crosswords["Year"].astype(int)xxxxxxxxxxfor i,row in badEntries.iterrows(): crosswords=crosswords[(crosswords.Outlet!=row.Outlet) | (crosswords.Date!=row.Date)]#remove bad crossword entriesxxxxxxxxxx### XWordInfoCuriously, their formatting is sometimes irregular. The numbering system calls the first puzzle "the debut puzzle" and others "puzzle # n". For puzzle number, this is an easy fix. They also call people Mr or Ms, depending on their gender. This was easy to replace. The harder part however, was the inconsistencies in their data. For some people, Their name was used instead of Mr X or Ms X. Since those are very few, I have taken the step to clean it by hand.There is a person's gender as "A". This arose due to how the scraping was done. Visiting the website, I have found the persons name and found out that their name is male.Curiously, their formatting is sometimes irregular. The numbering system calls the first puzzle "the debut puzzle" and others "puzzle # n". For puzzle number, this is an easy fix. They also call people Mr or Ms, depending on their gender. This was easy to replace. The harder part however, was the inconsistencies in their data. For some people, Their name was used instead of Mr X or Ms X. Since those are very few, I have taken the step to clean it by hand.
There is a person's gender as "A". This arose due to how the scraping was done. Visiting the website, I have found the persons name and found out that their name is male.
xxxxxxxxxxxwi.replace("puzzle # ","",regex=True,inplace=True)xwi.replace("the debut puzzle","1",inplace=True)xwi.replace("Mr","M",inplace=True)xwi.replace("Ms","F",inplace=True)xwi.replace("Jakob Weisblat","M",inplace=True)xwi.replace("Pao Roy","M",inplace=True)xwi.replace("Emet Ozar","F",inplace=True)xwi.replace("A","M",inplace=True) # A. Tariq#convert the puzzle number and their genderxxxxxxxxxxxwi=xwi[~xwi["Date"].str.contains("is")]xwi["Date"]=pd.to_datetime(xwi["Date"])#filtering out the valid entries, as some have not been updated, since the scraping method was to overshoot the datexxxxxxxxxx### Google NGramThe interesting thing about Google NGram is that it returns different results based on the capitalisation of the word. Hence, I tried both all caps and no caps form of the word.This has yielded a DataFrame with two of the same words. This is largely easy to clean, we just need to add the two columns together. The function used yields a nice sorted order.However, DataFrames are a terrible lookup table, hence I have chosen to convert them into dictionariesThe interesting thing about Google NGram is that it returns different results based on the capitalisation of the word. Hence, I tried both all caps and no caps form of the word. This has yielded a DataFrame with two of the same words. This is largely easy to clean, we just need to add the two columns together. The function used yields a nice sorted order.
However, DataFrames are a terrible lookup table, hence I have chosen to convert them into dictionaries
xxxxxxxxxxngram.columns=[x.upper() for x in ngram.columns]ngram=ngram.groupby(lambda x:x, axis=1).sum()ngram.to_csv("Clean NGram.csv")ngram.head()xxxxxxxxxxngramDict=ngram.to_dict()ngramDict["AAA"][2000]xxxxxxxxxxAlthough data cleaning is sparse in this project, it is compensated by the large amount of transformation of data in the EDA. This is caused by the data being scraped and concrete research into this niche area being rather lacking.Although data cleaning is sparse in this project, it is compensated by the large amount of transformation of data in the EDA. This is caused by the data being scraped and concrete research into this niche area being rather lacking.
xxxxxxxxxx<div class="alert alert-block alert-info">For each research questions shortlisted, outline your methodology in answering them. Discuss interesting observations or results discovered. Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.</div>Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.
xxxxxxxxxx### Q1. "Crosswordese"Firstly, let us define "short answers" as anything with at most 7 letters, and "long answers" as anything with at least 8 lettersFor each crossword, we will do the following:1. Group all the clues from the specific day2. Filter the DataFrame to only have short answers3. For each puzzle, for each short answer that appears on the NGram table, check it against the corresponding year that it appeared, and add it to a total, call this number the "Score"4. For each puzzle, for each short answer, compare it against the wordlist and give it the corresponding score, then add it all up for that outlet's daily crosswordWe will obtain a DataFrame with the following: Day, Date, Outlet, Score, AScore, BScoreThis is our base data for graphing.This data is saved in "Q1 Data.csv"Using the score, we can determine how much crosswordese is in it generally. The higher the score, the better the puzzle.Then, we plot to observe any trends.There is a limitation to this method. Some words are not found on the list and as such, I am unable to score them properly, hence I chose to give it a score of 0, as a baseline. This problem appears more when using the NGram dataset, as there are less entries and it is less extensive. However, it still provides a reasonably good image, as crosswords should be affected similarly by missing entries.Firstly, let us define "short answers" as anything with at most 7 letters, and "long answers" as anything with at least 8 letters For each crossword, we will do the following:
xxxxxxxxxxdef scoreNGram(row): year=row.Date.year if year>2019: year=2019 #we do not have ngram data for after 2020 try: row.Score=ngram[row.Answers][year]*10000 except KeyError: #this word is not within ngram row.Score=1e-3 return rowxxxxxxxxxxdictA=listA.to_dict()dictB=listB.to_dict()def scoreWordlistA(row): global dictA try: row.AScore=dictA["Score"][row.Answers] except KeyError: row.AScore=0 return rowdef scoreWordlistB(row): global dictB try: row.BScore=dictB["Score"][row.Answers] except KeyError: row.BScore=0 return row#just lookup using the wordlistsxxxxxxxxxxshort=crosswords[crosswords.Answers.str.len()<=7]short["AScore"]=0short["BScore"]=0short["Score"]=0short=short.apply(scoreWordlistA,axis=1)short=short.apply(scoreWordlistB,axis=1)short=short.apply(scoreNGram,axis=1) #score it against the wordlistshort=short.groupby(["Outlet","Date","Day"])[["Score","AScore","BScore"]].agg([sum,"count"])short.columns = [''.join(col) for col in short.columns]short.drop(columns=["Scorecount","AScorecount"],inplace=True)short.rename(columns={"Scoresum":"Score","AScoresum":"AScore","BScoresum":"BScore","BScorecount":"Count"},inplace=True)short=short.reset_index() #sum it upshort["AAScore"]=short["AScore"]/short["Count"]short["BBScore"]=short["BScore"]/short["Count"]short["NNScore"]=short["Score"]/short["Count"] #find mean of the scoreshort.Date=pd.to_datetime(short.Date)short["Year"]=short["Date"].dt.strftime("%Y")short["Year"]=short["Year"].astype(int)short["Month"]=short["Date"].dt.strftime("%m")short["Month"]=short["Month"].astype(int) #extract info about the dateshort.to_csv("ShortScore.csv") #save the info#this block of code transforms and forms the underlying dataset for this questionxxxxxxxxxxsns.stripplot(x="Day",y="AAScore",data=short)plt.title("Score using wordlist A against the day of week")plt.ylim((40,80))plt.ylabel("Score")passxxxxxxxxxxIn the above graph, there does not seem to be much correlation between the day of week and how much crosswordese is present when we test it with wordlist A.In the above graph, there does not seem to be much correlation between the day of week and how much crosswordese is present when we test it with wordlist A.
xxxxxxxxxxsns.stripplot(x="Day",y="NNScore",data=short)plt.title("Short score using NGram against the day of week")plt.ylabel("Short score")xxxxxxxxxxplt.figure(figsize=(12,5))sns.scatterplot(x="Year",y="NNScore",data=short)plt.title("Short score using NGram against the year")plt.ylabel("Short score")xxxxxxxxxxSomehow, when using the NGram dataset, which shows real life usage of these words, an interesting trend emerges. There seems to be 3 distinct sections of the stripplot. A more detailed discussion will be included in the results.Somehow, when using the NGram dataset, which shows real life usage of these words, an interesting trend emerges. There seems to be 3 distinct sections of the stripplot. A more detailed discussion will be included in the results.
xxxxxxxxxxplt.figure(figsize=(12,5))plt.ylim((58,70))sns.lineplot(x="Year",y="AAScore",data=short,hue="Outlet")plt.ylabel("Short score using wordlist")plt.title("Short scores of crosswords using wordlist over the years")passxxxxxxxxxxsns.displot(data=short,x="Date",y="AAScore",aspect=0.8,height=5)plt.ylim((40,80))plt.ylabel("Short score")plt.title("Short score in crosswords by wordlists throughout the years")plt.xlabel("Year")passxxxxxxxxxxsns.displot(data=short,x="Date",y="NNScore",aspect=0.8,height=5)plt.ylim((0,8))plt.ylabel("NGram short score")plt.title("Amount of crosswordese in crosswords throughout the years by NGram")plt.xlabel("Year")xxxxxxxxxxplt.figure(figsize=(12,5))sns.lineplot(data=short,x="Year",y="AAScore")plt.ylim((50,80))plt.ylabel("Short score")plt.title("Short score in crosswords throughout the years")passxxxxxxxxxxplt.figure(figsize=(12,5))sns.lineplot(data=short,x="Year",y="NNScore",hue="Outlet",ci=None)plt.ylabel("NGram short score")plt.title("NGram short score in various outlet's crosswords throughout the years")passxxxxxxxxxxWe can see that there is a slight trend upwards, showing improvement in the accessibility in terms of wordlist metrics. However, such a trend cannot be observed with NGram data.The dip in 2000 can be explained by missing data.We can see that there is a slight trend upwards, showing improvement in the accessibility in terms of wordlist metrics. However, such a trend cannot be observed with NGram data. The dip in 2000 can be explained by missing data.
xxxxxxxxxx# Q2 Freshness<p> The proceedure to answer this question will be similar to Q1, where the wordlist will be used for comparison. For the wordlist score, I will just use it accordingly. However, for my own scoring of freshness of long answer, I will be scoring it on a harmonic scale, with the i th occurance of the answer having a score of 1/i. Then, to evaluate the score of the crossword, I will just sum it up.Also, since long answers are rather rare in a crossword, counting them makes sense, so that will be taken into account too. <p><p> Afterwards, these points will just be plotted, to see if there are any trends to be spotted<p>The proceedure to answer this question will be similar to Q1, where the wordlist will be used for comparison. For the wordlist score, I will just use it accordingly. However, for my own scoring of freshness of long answer, I will be scoring it on a harmonic scale, with the i th occurance of the answer having a score of 1/i. Then, to evaluate the score of the crossword, I will just sum it up. Also, since long answers are rather rare in a crossword, counting them makes sense, so that will be taken into account too.
Afterwards, these points will just be plotted, to see if there are any trends to be spotted
xxxxxxxxxxlong=crosswords[crosswords.Answers.str.len()>=8]long.reset_index(inplace=True,drop=True)long["Score"]=0long["AScore"]=0long["BScore"]=0longxxxxxxxxxxdict={}def longScore(row): global dict ans=row.Answers try: row.Score+=1/(1+dict[ans]) dict[ans]+=1 except KeyError: dict[ans]=1 row.Score+=1 return rowlong=long.apply(longScore,axis=1)long=long.apply(scoreWordlistA,axis=1)long=long.apply(scoreWordlistB,axis=1)xxxxxxxxxxlong=long.groupby(["Outlet","Date","Day"])[["Score","AScore","BScore"]].agg([sum,"count"])long=long.reset_index()xxxxxxxxxxlong.columns = [''.join(col) for col in long.columns]long.drop(columns=["Scorecount","AScorecount"],inplace=True)long.rename(columns={"Scoresum":"Score","AScoresum":"AScore","BScoresum":"BScore","BScorecount":"Count"},inplace=True)xxxxxxxxxxlongxxxxxxxxxxlong.Date=pd.to_datetime(long.Date)xxxxxxxxxxlong["Year"]=long["Date"].dt.strftime("%Y")long["Year"]=long.Year.astype(int)long["Month"]=long["Date"].dt.strftime("%m")long.Month=long.Month.astype(int)xxxxxxxxxxlong.to_csv("longScore.csv")xxxxxxxxxxplt.figure(figsize=(12,5))sns.stripplot( y="Score", x="Day", data=long, jitter=0.2, alpha=0.5)plt.ylabel("Own Long Score")plt.title("Own Long Score against day")passxxxxxxxxxxplt.figure(figsize=(12,5))sns.stripplot( y="AScore", x="Day", data=long, jitter=0.2, alpha=0.5)plt.ylabel("Wordlist Long Score")plt.title("Wordlist Long Score against day")xxxxxxxxxxThrough the week, we can see that the long score increases.Through the week, we can see that the long score increases.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.displot(data=long,x="Date",y="AScore",aspect=0.8,height=7)plt.ylim(top=1500)plt.ylabel("Long Score")plt.title("Displot of long score against time")xxxxxxxxxxThe max of the long score seems to be increasing.The max of the long score seems to be increasing.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.lineplot(x="Year",y="AScore",data=long,hue="Outlet")plt.ylabel("Long Score")plt.title("Lineplot of long score of various outlet's crosswords over time")xxxxxxxxxxIn general, we can see that long score has been increasing, except in the case of Wall Street Journal, which has been decreasing.In general, we can see that long score has been increasing, except in the case of Wall Street Journal, which has been decreasing.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.boxplot(x="Month",y="AScore",data=long,hue="Outlet")plt.ylabel("Long Score")plt.title("Boxplot of long score against the month")xxxxxxxxxxThere seems to be no correlation between month and the quality of crosswords. This is expected.There seems to be no correlation between month and the quality of crosswords. This is expected.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.boxenplot(y="AScore",data=long,x="Outlet")plt.ylabel("Long Score")plt.title("Boxenplot of long score against outlet")xxxxxxxxxxplt.figure(figsize=(12,5))sns.stripplot(data=long,y="Count",x="Outlet")plt.title("Number of long answers for each outlet")xxxxxxxxxxWe can see that New York Times is the best for long answers, followed by LA Times and Wall Street Journal, then Universal and USA Today.In a similar fashion, New York Times has the most long answers, followed by Wall Street Journal, then LA Times, then Universal, then USA Today.We can see that New York Times is the best for long answers, followed by LA Times and Wall Street Journal, then Universal and USA Today. In a similar fashion, New York Times has the most long answers, followed by Wall Street Journal, then LA Times, then Universal, then USA Today.
xxxxxxxxxxlong["Year2"]=(long["Year"].astype(int))//5*5xxxxxxxxxxplt.figure(figsize=(12,5))sns.boxenplot(x="Year2",y="AScore",data=long)plt.ylabel("Long Score")plt.title("Boxenplot of long score against year")xxxxxxxxxxNo significant trend can be seen with this boxplot against time.No significant trend can be seen with this boxplot against time.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.regplot(data=long,x="AScore",y="BScore")plt.title("The scores of the two wordlists are generally similar")passxxxxxxxxxxWe can see that the long scoring between the two wordlists is similar and does not differ too much. It is thus reasonable to assume that changing the wordlist will not affect the results significantly. Hence our results are not that bad.We can see that the long scoring between the two wordlists is similar and does not differ too much. It is thus reasonable to assume that changing the wordlist will not affect the results significantly. Hence our results are not that bad.
xxxxxxxxxx### Q3. Inclusive CluesIn this section, I will be analysing how inclusive clues are over the years.I will first concatanate all the words in the crossword, then for each word, search for it in the namelist.For each match, assign one point, then we can plot some trends.The two namelists used will be from baby websites.Of course, this method is limited by the namelist, however, with 1000 names for each gender, it should be fairly robust.In this section, I will be analysing how inclusive clues are over the years. I will first concatanate all the words in the crossword, then for each word, search for it in the namelist. For each match, assign one point, then we can plot some trends. The two namelists used will be from baby websites. Of course, this method is limited by the namelist, however, with 1000 names for each gender, it should be fairly robust.
xxxxxxxxxxsample=crosswords.copy()sample["Words"]=sample["Clues"]+' '+sample["Answers"]xxxxxxxxxxsample.Words=sample.Words.astype(str)names=sample.groupby(['Outlet','Date'])['Words'].apply(' '.join).reset_index()names["Words"]=names["Words"].str.replace("[\"_.“()”!:']+",regex=True,repl="")names["Words"]=names["Words"].str.split("[ ]+",regex=True).to_frame().reset_index(drop=True)namesxxxxxxxxxxgNames=list(val.lower() for val in gNames)gNames=set(gNames)def gNameSearch(row): for word in row.Words: if word.lower() in gNames: row.GScore+=1 return rowbNames=list(val.lower() for val in bNames)bNames=set(bNames)def bNameSearch(row): for word in row.Words: if word.lower() in bNames: row.BScore+=1 return rowxxxxxxxxxxnames["BScore"]=0names["GScore"]=0names=names.apply(bNameSearch,axis=1)names=names.apply(gNameSearch,axis=1)xxxxxxxxxxnames["Year"]=names["Date"].dt.strftime("%Y")names["Year"]=names["Year"].astype(int)xxxxxxxxxxplt.figure(figsize=(12,5))names["Total"]=names["BScore"]+names["GScore"]sns.lineplot(data=names[names.Outlet!="Wall Street Journal"],x="Year",y="Total",hue="Outlet",ci=None)plt.ylabel("No. of Names")plt.title("Occurrences of names in various outlet's crosswords over the years")xxxxxxxxxxWe can see that, in general, the number of names being used is decreasing. This may also be an effect of accessibility, trying to make it more about words than obscure celebrities. Wall Street Journal was removed from this comparison as it had values too high to scale the graph to be unreadable.We can see that, in general, the number of names being used is decreasing. This may also be an effect of accessibility, trying to make it more about words than obscure celebrities. Wall Street Journal was removed from this comparison as it had values too high to scale the graph to be unreadable.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.lineplot(data=names,x="Year",y="BScore",color="b")sns.lineplot(data=names,x="Year",y="GScore",color="pink")plt.ylabel("Occurrences of Gendered Names")plt.title("Occurrences of gendered names in crosswords over the years")passxxxxxxxxxxplt.figure(figsize=(12,5))sns.lineplot(data=names[names.Outlet=="Wall Street Journal"],x="Year",y="BScore",color="b")sns.lineplot(data=names[names.Outlet=="Wall Street Journal"],x="Year",y="GScore",color="pink")plt.ylabel("Occurrences of Gendered Names")plt.title("Occurrences of gendered names in Wall Street Journal crosswords over the years")passxxxxxxxxxxplt.figure(figsize=(12,5))sns.lineplot(data=names[names.Outlet=="USA Today"],x="Year",y="BScore",color="b")sns.lineplot(data=names[names.Outlet=="USA Today"],x="Year",y="GScore",color="pink")plt.ylabel("occurrences of Gendered Names")plt.title("occurrences of gendered names in USA Today crosswords over the years")passxxxxxxxxxxFor Wall Street Journal, they have been constantly decreasing the number of names in their crosswords as well.In general, we find that outlets have been including more male names than female names. But, this is not true for USA Today. Surprisingly, we find that now, the number of female names appear more than male names. This interesting observation will be dicussed in greater detail in the results section.For Wall Street Journal, they have been constantly decreasing the number of names in their crosswords as well. In general, we find that outlets have been including more male names than female names. But, this is not true for USA Today. Surprisingly, we find that now, the number of female names appear more than male names. This interesting observation will be dicussed in greater detail in the results section.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.boxplot(data=names,x="Outlet",y="Total")plt.ylabel("Number of names")plt.title("Number of names in crosswords per outlet")xxxxxxxxxxMost outlets use few names in their crosswords, except for the Wall Street Journal, which uses them more often than the rest.Most outlets use few names in their crosswords, except for the Wall Street Journal, which uses them more often than the rest.
xxxxxxxxxxnamesxxxxxxxxxx### Q4. ConstructorsNow, we combine the genders of the constructors with the NYT crosswords and do some analysis. This is putting all the results together, making use of everything before. We will use the results from Q1,Q2 and Q3 to assist us in our exploration.First, let us combine all the results into one dataframe.Now, we combine the genders of the constructors with the NYT crosswords and do some analysis. This is putting all the results together, making use of everything before. We will use the results from Q1,Q2 and Q3 to assist us in our exploration. First, let us combine all the results into one dataframe.
xxxxxxxxxxshort2=short.rename(columns={"AAScore":"AShort","BBScore":"BShort","NNScore":"NShort"})short2.rename(columns={"AScore":"AShortSum","BScore":"BShortSum","Score":"NShortSum"},inplace=True)long2=long.rename(columns={"AScore":"ALong","BScore":"BLong","Score":"NLong"})names2=names.rename(columns={"BScore":"BNames","GScore":"GNames"})short2.drop(columns=["Count","Day","Year","Month"],inplace=True)long2.drop(columns=["Year"],inplace=True)xwData=pd.merge(short2,long2,how="outer",on=["Outlet","Date"])xwData=pd.merge(xwData,names2,how="outer",on=["Outlet","Date"])xxxxxxxxxxxwData.to_csv("Q1Q2Q3.csv")xxxxxxxxxxxwData=pd.read_csv("Q1Q2Q3.csv",index_col=0)xxxxxxxxxxAlso, I want to plot some trends involving Q1, Q2 and Q3, to assist in this question.Unfortunately, no visible trends can be spotted here.Also, I want to plot some trends involving Q1, Q2 and Q3, to assist in this question. Unfortunately, no visible trends can be spotted here.
xxxxxxxxxxsns.pairplot(data=xwData[["AShort","ALong","BNames","GNames"]])xxxxxxxxxxFor this question, we only have data from the NYT, hence we need to slice the dataframe and merge it with the constructor info.For this question, we only have data from the NYT, hence we need to slice the dataframe and merge it with the constructor info.
xxxxxxxxxxxwData["Date"]=pd.to_datetime(xwData["Date"])xwData.drop(columns=["Day"],inplace=True)nytData=pd.merge(xwi,xwData[xwData.Outlet=="New York Times"],on="Date",how="inner")nytData.drop(columns=["Outlet"],inplace=True) #we already know its from New York TimesnytData=nytData[nytData.Month.notna()] # we have some na entriesnytData["Month"]=nytData["Month"].astype(int)nytData["Year"]=nytData["Year"].astype(int)nytData["C1 No."]=nytData["C1 No."].astype(int,errors="ignore")nytData.reset_index(inplace=True,drop=True)nytData.info()xxxxxxxxxxplt.figure(figsize=(12,5))sns.countplot(data=nytData,x="Year",hue="C1 Gender")plt.title("Number of crosswords constructed each year for the NYT, split by gender")xxxxxxxxxxWe can see that, the crossword scene in the New York Times is primarily male dominated.We can see that, the crossword scene in the New York Times is primarily male dominated.
xxxxxxxxxxdaysOfWeek=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]plt.figure(figsize=(12,5))sns.countplot(data=nytData,x="Day",hue="C1 Gender")plt.title("Amount of constructors for each day of the week, split by gender")xxxxxxxxxxAs the week goes on, more and more male constructors appear, and unfortunately, the number of female constructors decrease. Except for Sunday, which has a similar difficulty to Wed/Thurs puzzles. One may infer that the tougher difficulty may cause more female constructors to not construct.As the week goes on, more and more male constructors appear, and unfortunately, the number of female constructors decrease. Except for Sunday, which has a similar difficulty to Wed/Thurs puzzles. One may infer that the tougher difficulty may cause more female constructors to not construct.
xxxxxxxxxxsns.lineplot(data=nytData,x="Year",y="NShort",hue="C1 Gender")plt.title("NGram short score against year, split by gender")plt.ylabel("NGram short score")xxxxxxxxxxsns.lineplot(data=nytData,x="Year",y="NLong",hue="C1 Gender")plt.ylabel("Self-Long Score")plt.title("Self-long score against year, split by gender")xxxxxxxxxxsns.lineplot(data=nytData,x="Year",y="AShort",hue="C1 Gender")plt.ylabel("Wordlist Short Score")plt.title("Wordlist short score against year, split by gender")xxxxxxxxxxsns.lineplot(data=nytData,x="Year",y="ALong",hue="C1 Gender")plt.ylabel("Wordlist Long Score")plt.title("Wordlist long score against year, split by gender")xxxxxxxxxxWe can see that crosswordese generally remains similar, but freshness is higher among menWe can see that crosswordese generally remains similar, but freshness is higher among men
xxxxxxxxxxsns.boxplot(data=nytData,x="C1 Gender",y="AShort")plt.ylim((40,65))plt.ylabel("Wordlist Short Score")plt.title("Boxplot of wordlist short score, split by gender")xxxxxxxxxxGenerally, using wordlists, there seems to be no trend between the crosswordese amount between men and women.Generally, using wordlists, there seems to be no trend between the crosswordese amount between men and women.
xxxxxxxxxxsns.boxplot(data=nytData,x="C1 Gender",y="ALong")plt.ylabel("Wordlist Long Score")plt.title("Boxplot of wordlist long score, split by gender")xxxxxxxxxxsns.boxplot(data=nytData,x="C1 Gender",y="NLong")plt.ylabel("Own Metric Long Score")plt.title("Boxplot of own metric long score, split by gender")xxxxxxxxxxUsing wordlists and NGrams, men have a higher long answer score then women.Using wordlists and NGrams, men have a higher long answer score then women.
xxxxxxxxxxsns.lineplot(data=nytData,y="BNames",x="Year",hue="C1 Gender")plt.ylabel("No. of boy names")plt.title("No. of boy names in NYT Crosswords over the years")plt.ylim((3,8))passxxxxxxxxxxsns.lineplot(data=nytData,y="GNames",x="Year",hue="C1 Gender")plt.ylabel("No. of girl names")plt.title("No. of girl names in NYT Crosswords over the years")plt.ylim((3,8))passxxxxxxxxxxBetween men and women, they use similar number of boy's names used. However, women are generally more inclined to use girl names.Between men and women, they use similar number of boy's names used. However, women are generally more inclined to use girl names.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.scatterplot(data=nytData[nytData["C2 Gender"]=="-"],x="C1 No.",y="AShort")plt.title("Short Score of constructors by experience")plt.ylabel("Short Score")xxxxxxxxxxplt.figure(figsize=(12,5))sns.scatterplot(data=nytData[nytData["C2 Gender"]=="-"],x="C1 No.",y="ALong")plt.title("Long Score of constructors by experience")plt.ylabel("Long Score")passxxxxxxxxxxFrom the two graphs, we can see that as a constructor makes more puzzles, their "worst" puzzle scores increases, meaning that they become more consistent in making puzzles.From the two graphs, we can see that as a constructor makes more puzzles, their "worst" puzzle scores increases, meaning that they become more consistent in making puzzles.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.lineplot(data=nytData[(nytData["C1 No."]<=10)],x="Year",y="AShort")sns.lineplot(data=nytData[(nytData["C1 No."]>10)],x="Year",y="AShort",color="g")plt.ylabel("Short Score")plt.title("Newer constructors have quite similar short scores to seasoned ones")xxxxxxxxxxplt.figure(figsize=(12,5))sns.lineplot(data=nytData[(nytData["C1 No."]<=10)],x="Year",y="ALong")sns.lineplot(data=nytData[(nytData["C1 No."]>10)],x="Year",y="ALong",color="g")plt.ylabel("Long Score")plt.title("Newer constructors have higher long scores to seasoned ones")xxxxxxxxxxplt.figure(figsize=(12,5))nytData["C1 Bin"]=nytData["C1 No."]//10*10sns.lineplot(data=nytData,x="C1 Bin",y="BNames")sns.lineplot(data=nytData,x="C1 Bin",y="GNames",color="pink")plt.title("Usage of names against constructor number")plt.ylabel("Number of names")plt.xlabel("Number of previous puzzles constructed")xxxxxxxxxxAs someone constructs more puzzles, the number of names they use decreases. This suggests that they want to make their puzzles more accessible. Additionally, the difference in number of gendered names they use becomes more and more similar, suggesting that they may be striving to be more inclusive. As someone constructs more puzzles, the number of names they use decreases. This suggests that they want to make their puzzles more accessible. Additionally, the difference in number of gendered names they use becomes more and more similar, suggesting that they may be striving to be more inclusive.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.displot(data=nytData,x="Date",y="C1 No.")plt.ylabel("Number of previous puzzles constructed")plt.title("Constructor number against year")xxxxxxxxxxThis graph just shows that constructor number increases with time, which is an expected trend. It also shows how constructors keep returning to the New York Times. However, a majority also only have 1 puzzle in the New York Times, as we can see in the more darkly colored section closer to 0.This graph just shows that constructor number increases with time, which is an expected trend. It also shows how constructors keep returning to the New York Times. However, a majority also only have 1 puzzle in the New York Times, as we can see in the more darkly colored section closer to 0.
xxxxxxxxxxplt.figure(figsize=(16,5))sns.displot(data=nytData,x="Day",y="C1 No.",row_order=daysOfWeek,height=8)plt.ylabel("Number of previous puzzles constructed")plt.title("Constructor number against the day of week they get published")xxxxxxxxxxplt.figure(figsize=(16,5))sns.displot(data=nytData[nytData["C1 No."]<10],x="Day",y="C1 No.",height=8,row_order=daysOfWeek)plt.ylabel("Number of previous puzzles constructed")plt.title("Constructor number against the day of week they get published")xxxxxxxxxxUnfortuately, it seems that seaborn has a bug which prevents me from sorting the row properly.In general, we find that as the week goes from Monday to Saturday, the number of new constructors decreases. The abnomality here is the Sunday puzzle.Unfortuately, it seems that seaborn has a bug which prevents me from sorting the row properly. In general, we find that as the week goes from Monday to Saturday, the number of new constructors decreases. The abnomality here is the Sunday puzzle.
xxxxxxxxxxcollab=nytData[nytData["C2 Gender"]!="-"]collab["C2 No."]=collab["C2 No."].astype(int).copy()collabxxxxxxxxxxsns.scatterplot(data=collab,x="C1 No.",y="C2 No.")plt.xlabel("Puzzle of the 1st constructor")plt.ylabel("Puzzle of the 2nd constructor")plt.title("Scatterplot of the relationship of the puzzle number of collaborators")xxxxxxxxxxThere seems to be no correlation between who collaborates with who. However, we can see a clear clustering of points about the x- and y-axes. This suggests that collaborations are mostly used to induct new constructors into the New York Times crossword.There seems to be no correlation between who collaborates with who. However, we can see a clear clustering of points about the x- and y-axes. This suggests that collaborations are mostly used to induct new constructors into the New York Times crossword.
xxxxxxxxxxsns.histplot(data=collab,x="Year",color="b",bins=range(1997,2023))plt.title("Number of collaborations have been increasing year on year")plt.ylabel("Number of collaborations")xxxxxxxxxxWhile the number of daily crosswords have remained similar throughout the years, the number of collaborations have increased. This suggests more openness within the community to induct new people and a greater sense of community.While the number of daily crosswords have remained similar throughout the years, the number of collaborations have increased. This suggests more openness within the community to induct new people and a greater sense of community.
xxxxxxxxxxnytData["C1C2 Gender"]=nytData["C1 Gender"].copy()+nytData["C2 Gender"].copy()plt.figure(figsize=(16,5))sns.countplot(data=nytData,x="Year",hue="C1C2 Gender")plt.title("Number of puzzles published by different pairs of constructors over the years")xxxxxxxxxxcollab["C1C2 Gender"]=collab["C1 Gender"].copy()+collab["C2 Gender"].copy()plt.figure(figsize=(16,5))sns.countplot(data=collab,x="Year",hue="C1C2 Gender")plt.title("Number of puzzles published by different pairs of constructors over the years")xxxxxxxxxxNo clear trends can be seen between the genders of collaborators.No clear trends can be seen between the genders of collaborators.
xxxxxxxxxxsns.lineplot(data=nytData,x="Year",y="AShort")sns.lineplot(data=collab,x="Year",y="AShort",color="g")plt.ylabel("Short Score")plt.title("Short score of crosswords over time, split by collaborations")xxxxxxxxxxsns.lineplot(data=nytData,x="Year",y="ALong")sns.lineplot(data=collab,x="Year",y="ALong",color="g")plt.ylabel("Long Score")plt.title("Long score of crosswords over time, split by collaborations")xxxxxxxxxxsns.lineplot(data=nytData[(nytData["C1 No."]<10)],x="Year",y="AShort")sns.lineplot(data=collab[(collab["C1 No."]<10) | (collab["C2 No."]<10)],x="Year",y="AShort",color="g")plt.ylabel("Short Score")plt.title("Short score of crosswords over time by newer constructors, split by collaborations")xxxxxxxxxxsns.lineplot(data=nytData[(nytData["C1 No."]<10)],x="Year",y="ALong")sns.lineplot(data=collab[(collab["C1 No."]<10) | (collab["C2 No."]<10)],x="Year",y="ALong",color="g")plt.ylabel("Long Score")plt.title("Long score of crosswords by newer constructors over time, split by collaborations")xxxxxxxxxxCollaborations seem to have similar amounts of crosswordese, and increase the freshness of puzzles, which is an overall gain. This is more so seen with newer constructors, which is a benefit, as it makes it easier for them to be accepted by the New York Times.Collaborations seem to have similar amounts of crosswordese, and increase the freshness of puzzles, which is an overall gain. This is more so seen with newer constructors, which is a benefit, as it makes it easier for them to be accepted by the New York Times.
xxxxxxxxxx<div class="alert alert-block alert-warning">For each research question, summarize in 2-3 visualizations which will answer the question. Intrepret the results accordingly and give your observation and conclusion. The visualizations should be well presented (apply what you have learnt in Chapter 9 on data communication). The plots shown here could be an enhanced version of the EDA plots, or presented in another format.</div>xxxxxxxxxxFor this section, Day 0 refers to Monday, going through the week until Day 6, Sunday.For this section, Day 0 refers to Monday, going through the week until Day 6, Sunday.
xxxxxxxxxxFor this section, I define short score as how "good" entries with 7 letters or less are, using a metric of either the scoring given by wordlist makers, or their frequency on Google NGram. For the score, the higher the better.Remember that crosswordese is defined as the amount of obscure fill. The higher the score, the lower the crosswordese.For this section, I define short score as how "good" entries with 7 letters or less are, using a metric of either the scoring given by wordlist makers, or their frequency on Google NGram. For the score, the higher the better. Remember that crosswordese is defined as the amount of obscure fill. The higher the score, the lower the crosswordese.
xxxxxxxxxxnrow=1ncol=2 fig = plt.figure(figsize=(16,5))gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.5)axes = gs.subplots(sharex=True, sharey=False)sns.stripplot(ax=axes[0],x="Day",y="AAScore",data=short,alpha=0.01)sns.boxplot(ax=axes[0],x="Day",y="AAScore",data=short)axes[0].set_title("Short score using wordlist A against the day of week")axes[0].set_ylim((40,80))axes[0].set_ylabel("Wordlist Short Score")sns.stripplot(ax=axes[1],x="Day",y="NNScore",data=short)axes[1].set_title("Short score using NGram against the day of week")axes[1].set_ylabel("NGram Short Score")plt.suptitle("Short Scores of crosswords, scored using wordlists and NGram, split by day")passxxxxxxxxxx<p>From the wordlist scoring, we can see that the average score of crosswords decreases throughout the week, albeit minimally. This may signal the puzzle being more and more inaccessible to puzzlers, a phenomenon that is caused by editors wishing to cater to the crossword buffs. The wordlist boxplot would show them trying to cater to both newbies and seasoned puzzlers, by making early week puzzles very accessible, and later week ones harder. This is a reasonable compromise by them.<p><p>From the NGram score, we can clearly observe that there are three distinct sections in the data. A possible explanation is that it is the fault of the NGram dataset for being more sparse but I suspect that not to be the case. It may be that crosswords are just inaccessible to the general public if they are not within this community, which may explain why crossword scores are higher when the person grading it is a crossword maker themself, and can thus understand the issues. Hence, the wordlist scores are more closely clustered together, as their scoring is rather similar, not deviating by too much, whilst the NGram scores are more far spread since they represent real world use, which is very very different from crosswords.<p>From the wordlist scoring, we can see that the average score of crosswords decreases throughout the week, albeit minimally. This may signal the puzzle being more and more inaccessible to puzzlers, a phenomenon that is caused by editors wishing to cater to the crossword buffs. The wordlist boxplot would show them trying to cater to both newbies and seasoned puzzlers, by making early week puzzles very accessible, and later week ones harder. This is a reasonable compromise by them.
From the NGram score, we can clearly observe that there are three distinct sections in the data. A possible explanation is that it is the fault of the NGram dataset for being more sparse but I suspect that not to be the case. It may be that crosswords are just inaccessible to the general public if they are not within this community, which may explain why crossword scores are higher when the person grading it is a crossword maker themself, and can thus understand the issues. Hence, the wordlist scores are more closely clustered together, as their scoring is rather similar, not deviating by too much, whilst the NGram scores are more far spread since they represent real world use, which is very very different from crosswords.
xxxxxxxxxxnrow=1ncol=2 # make a list of all dataframes fig = plt.figure(figsize=(16,5))gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.2)axes = gs.subplots(sharex=True, sharey=False)axes[0].set_ylim((58,70))sns.lineplot(ax=axes[0],x="Year",y="AAScore",data=short,hue="Outlet")axes[0].set_ylabel("Short score using wordlist")axes[0].set_title("Short scores of crosswords using wordlist over the years")sns.lineplot(ax=axes[1],x="Year",y="NNScore",data=short,hue="Outlet",ci=False)axes[1].set_ylabel("Short score using NGram")axes[1].set_title("Short scores of crosswords using NGram over the years")plt.suptitle("Short scores of crosswords over the years, split by outlet")xxxxxxxxxxFrom the wordlist graph, we can generally infer that, by wordlist standards, all the outlets have their score to be increasing. This graph shows some evidence that the amount of crosswordese has decreased throughout the years. This would mean greater accessibility for the average solver, which knows at least some crosswordese.<p>From the NGram graph, it seems more erratic, with only USA Today having seen a significant increase. This would suggest a total novice, a person that has never seen a puzzle before, could have a much harder time. A moderate amount of knowledge is needed to break into solving crosswords, but that skill floor has decreased over the years. <p>The interesting outlier here is USA Today. This outlier is the deliberate action. USA Today crosswords touts themselves as being one of the easier crosswords, being a beginner friendly puzzle. Indeed, it has seem to show that, with a great improvement in accessibility in recent years. This big jump does not imply the superiority over other outlets.<p> Instead, it showcases a compromise that they have made. Traditionally, crosswords are rotationally or reflectionally symmetrical. However, in recent years, they have given up on symmetry, in favour of better fill and less crosswordese. It can be seen that this method works as a compromise, being less elegant, but being able to induct more solvers into the crossword universe, being an overall positive gain.From the wordlist graph, we can generally infer that, by wordlist standards, all the outlets have their score to be increasing. This graph shows some evidence that the amount of crosswordese has decreased throughout the years. This would mean greater accessibility for the average solver, which knows at least some crosswordese.
From the NGram graph, it seems more erratic, with only USA Today having seen a significant increase. This would suggest a total novice, a person that has never seen a puzzle before, could have a much harder time. A moderate amount of knowledge is needed to break into solving crosswords, but that skill floor has decreased over the years.
The interesting outlier here is USA Today. This outlier is the deliberate action. USA Today crosswords touts themselves as being one of the easier crosswords, being a beginner friendly puzzle. Indeed, it has seem to show that, with a great improvement in accessibility in recent years. This big jump does not imply the superiority over other outlets.
Instead, it showcases a compromise that they have made. Traditionally, crosswords are rotationally or reflectionally symmetrical. However, in recent years, they have given up on symmetry, in favour of better fill and less crosswordese. It can be seen that this method works as a compromise, being less elegant, but being able to induct more solvers into the crossword universe, being an overall positive gain.
xxxxxxxxxxFor this section, I define long score as a metric for freshness, the higher the long score, the more fresh the puzzle is. Higher freshness is better.For this section, I define long score as a metric for freshness, the higher the long score, the more fresh the puzzle is. Higher freshness is better.
xxxxxxxxxxnrow=1ncol=2 # make a list of all dataframes fig = plt.figure(figsize=(16,5))gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.2)axes = gs.subplots(sharex=True, sharey=False)plt.figure(figsize=(12,5))sns.stripplot( y="Score", x="Day", data=long, jitter=0.2, alpha=0.2,ax=axes[0])axes[0].set_ylabel("Own long Score")axes[0].set_title("Own long Score against day")sns.stripplot( y="AScore", x="Day", data=long, jitter=0.2, alpha=0.2,ax=axes[1])axes[1].set_ylabel("Wordlist long Score")axes[1].set_title("Wordlist long Score against day")plt.suptitle("Long score of crosswords, split by day of week and metric used")passxxxxxxxxxxAs the week goes on, the higher and higher the long score. Obviously, for Sunday puzzles, they will have a higher long score with more grid space. However, for the other days of the week, a different explanation is required.A possible reason is the increasing difficulty of crosswords through the week, especially Friday and Saturday, which are themeless puzzles for some outlets. By being unrestricted from any theme, all their long answers must shine, and there must be more of them, since they have more freedom to grid it. This can be seen using both our own metric and the wordlist, confirming the results. What this means is that, the later week puzzles, even though they are less accessible for newbie solvers, veterans will be satisfied to know that the puzzle is waiting for them with many snazzy answers.As the week goes on, the higher and higher the long score. Obviously, for Sunday puzzles, they will have a higher long score with more grid space. However, for the other days of the week, a different explanation is required. A possible reason is the increasing difficulty of crosswords through the week, especially Friday and Saturday, which are themeless puzzles for some outlets. By being unrestricted from any theme, all their long answers must shine, and there must be more of them, since they have more freedom to grid it. This can be seen using both our own metric and the wordlist, confirming the results. What this means is that, the later week puzzles, even though they are less accessible for newbie solvers, veterans will be satisfied to know that the puzzle is waiting for them with many snazzy answers.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.lineplot(x="Year",y="AScore",data=long,hue="Outlet")plt.ylabel("Long Score")plt.title("Lineplot of long score of various outlet's crosswords over time")xxxxxxxxxxBased on this line chart alone, we can tell that the New York Times is the best place for fresh fill. This may explain why they are said to be "the gold standard". For some background information, they have the highest rates in the industry, paying about 500USD per puzzle. This potential profit draws many constructors and makes it so the NYT gets more submissions. Then, they have the luxury to prune only the best, which gives them a competitive edge over the other outlets. While most other outlets generally have not seen their freshness change, the New York Times certainly has, as seen here to be above the rest. <p>The increasing trend is showing that constructors are constantly raising the bar on what they can do and how fresh the puzzles are. We notice that with the rise of computer construction software, it has never been easier to construct crosswords. Trial and error is no longer required, and now constructors can focus solely on making their crossword the best it can be. In my opinion, this graph does reflect such a shift. <p>Wall Street Journal's decline may be explained by the fact that they have less Sunday crosswords now, which does affect their long score. However, nowadays, it matches with most other outlets. <p>Based on this line chart alone, we can tell that the New York Times is the best place for fresh fill. This may explain why they are said to be "the gold standard". For some background information, they have the highest rates in the industry, paying about 500USD per puzzle. This potential profit draws many constructors and makes it so the NYT gets more submissions. Then, they have the luxury to prune only the best, which gives them a competitive edge over the other outlets. While most other outlets generally have not seen their freshness change, the New York Times certainly has, as seen here to be above the rest.
The increasing trend is showing that constructors are constantly raising the bar on what they can do and how fresh the puzzles are. We notice that with the rise of computer construction software, it has never been easier to construct crosswords. Trial and error is no longer required, and now constructors can focus solely on making their crossword the best it can be. In my opinion, this graph does reflect such a shift.
Wall Street Journal's decline may be explained by the fact that they have less Sunday crosswords now, which does affect their long score. However, nowadays, it matches with most other outlets.
xxxxxxxxxxnrow=2ncol=3 # make a list of all dataframes fig = plt.figure(figsize=(16,4))gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0)axes = gs.subplots(sharex=True, sharey=True)# plot counterimport matplotlib.patches as mpatchesoutlets=["New York Times","L.A. Times Daily","USA Today","Universal","Wall Street Journal"]count=0for r in range(nrow): for c in range(ncol): if count==5: sns.lineplot(ax=axes[1,2],data=names,x="Year",y="BScore",color="b") sns.lineplot(ax=axes[1,2],data=names,x="Year",y="GScore",color="pink") axes[1,2].set_title("All Crosswords") else: sns.lineplot(ax=axes[r,c],data=names[names.Outlet==outlets[count]],x="Year",y="BScore",color="b") sns.lineplot(ax=axes[r,c],data=names[names.Outlet==outlets[count]],x="Year",y="GScore",color="pink") axes[r,c].set_title(outlets[count]) count+=1 axes[r,c].set_ylabel("occurrences")axes[0,2].legend(handles=[mpatches.Patch(color='b'),mpatches.Patch(color='pink')],labels=["Male","Female"])plt.suptitle("occurrences of Gendered names in crosswords over the years")xxxxxxxxxxThis graph shows the 5 major outlets and how their occurrences of gendered names have changed throughout the years. New York Times, L.A. Times Daily and Wall Street Journal still use more male names than female names of their crossword, whilst Universal and USA Today seem to be closing the gap. The reason for this change is probably deliberate. At the helm of editors who aim to be more inclusive, these crosswords want to reflect society more fully, trying to get more females to do the puzzle.<p>Even though its less obvious, the New York Times also seems to be trying to be more inclusive, with a dip in the number of males name used. This may be due to some other factors, which will be discussed in question 4. <p>On a full scale, the number of gendered names seem to be converging such that there are more female ones and less male ones. This should be celebrated, as it reflects a change in perception in the crossword.Since the crossword somewhat reflects who made the puzzle, the greater similarity in names shows how, the crossword is getting more divrse. No longer is the crossword only for men, but now more people can see themselves in it. That does have an influence on how someone feels, when they see something they identify with rather than a baseball team.This graph shows the 5 major outlets and how their occurrences of gendered names have changed throughout the years. New York Times, L.A. Times Daily and Wall Street Journal still use more male names than female names of their crossword, whilst Universal and USA Today seem to be closing the gap. The reason for this change is probably deliberate. At the helm of editors who aim to be more inclusive, these crosswords want to reflect society more fully, trying to get more females to do the puzzle.
Even though its less obvious, the New York Times also seems to be trying to be more inclusive, with a dip in the number of males name used. This may be due to some other factors, which will be discussed in question 4.
On a full scale, the number of gendered names seem to be converging such that there are more female ones and less male ones. This should be celebrated, as it reflects a change in perception in the crossword. Since the crossword somewhat reflects who made the puzzle, the greater similarity in names shows how, the crossword is getting more divrse. No longer is the crossword only for men, but now more people can see themselves in it. That does have an influence on how someone feels, when they see something they identify with rather than a baseball team.
xxxxxxxxxxplt.figure(figsize=(12,5))sns.boxplot(data=names,x="Outlet",y="Total")plt.ylabel("Number of names")plt.title("Number of names in crosswords per outlet")passxxxxxxxxxxIt is also interesting to note that, most outlets use generally about the same number of names. However, the clear outlier here is the Wall Street Journal. With more names being used, it may make it harder for someone who does not recognise them to be able to solve the crossword, going back to the issue of crosswordese. Since the number of names being used hovers around 8-9 as a median, even more so from the Wall Street Journal, it is just pertinent that the crosswords contain a diverse set of names, so that no one group of people feel left out. Names are not just another clue, they have the power to make us connect and feel things. <p>On a side note, it may be that the Wall Street Journal dataset is skewed by the fact that they have more Sunday crosswords, which are bigger, and hence they may contain more names. If this is the case, it probably would not deviate from the general trend of other crosswords too much.It is also interesting to note that, most outlets use generally about the same number of names. However, the clear outlier here is the Wall Street Journal. With more names being used, it may make it harder for someone who does not recognise them to be able to solve the crossword, going back to the issue of crosswordese. Since the number of names being used hovers around 8-9 as a median, even more so from the Wall Street Journal, it is just pertinent that the crosswords contain a diverse set of names, so that no one group of people feel left out. Names are not just another clue, they have the power to make us connect and feel things.
On a side note, it may be that the Wall Street Journal dataset is skewed by the fact that they have more Sunday crosswords, which are bigger, and hence they may contain more names. If this is the case, it probably would not deviate from the general trend of other crosswords too much.
xxxxxxxxxx#sns.countplot(data=nytData,x="Year",hue="C1 Gender")pivot=pd.concat([nytData.groupby("Year")["C1 Gender"].count(),nytData[(nytData["C1 Gender"]!="F") & (nytData["C2 Gender"]!="F")].groupby("Year")["C1 Gender"].count(),nytData[(nytData["C1 Gender"]=="F") | (nytData["C2 Gender"]=="F")].groupby("Year")["C1 Gender"].count()],axis=1)pivot.columns=["Total","Male","Female"]plt.figure(figsize=(17,5))plt.bar([x for x in range(1997,2023) if x!=2000], pivot["Male"]/pivot["Total"], label="Male") #plot the bottom most bar firstplt.bar([x for x in range(1997,2023) if x!=2000], pivot["Female"]/pivot["Total"], label="Female", bottom=pivot["Male"]/pivot["Total"])#missing data in 2000plt.title("The NYT Crossword is still very male dominated")plt.ylabel("Relative frequency")plt.xlabel("Year")plt.legend()passxxxxxxxxxxpivot=pd.concat([nytData.groupby("Day")["C1 Gender"].count(),nytData[(nytData["C1 Gender"]!="F") & (nytData["C2 Gender"]!="F")].groupby("Day")["C1 Gender"].count(),nytData[(nytData["C1 Gender"]=="F") | (nytData["C2 Gender"]=="F")].groupby("Day")["C1 Gender"].count()],axis=1)pivot.columns=["Total","Male","Female"]pivot.sort_values("Male",ascending=True,inplace=True)plt.figure(figsize=(17,5))plt.bar(pivot.index, pivot["Male"]/pivot["Total"], label="Male")plt.bar(pivot.index, pivot["Female"]/pivot["Total"], label="Female", bottom=pivot["Male"]/pivot["Total"])plt.title("The NYT Crossword is still very male dominated")plt.ylabel("Relative frequency")plt.xlabel("Day of week")plt.legend()passxxxxxxxxxxThis first graph shows how the NYT is still very much male-dominated in terms of constructors, even as, since 2020, the number of female constructors are higher than the years before. Most crosswords are still made by male constructors. <p>The problem is exaserbated when the trend is split by the day of the week. Firstly, notice that the x-axis is generally by the days of the week, except for Sunday. Other than that, since the difficulty of the puzzle increases throughout the week, it is a possible reason why less women construct on those days. As discussed previously, the later days of the week are for crossword buffs, and hence, they may be less inclined to construct for those days, as it is a rather gated community. Earlier days in the week are indeed more accessible to them, which is why they may choose to pick those instead. <p>This is not to discount them as constructors, however. It may very well be the causes of external factors, like even personal preference that comes down to why they construct those puzzles less. However, it is still a problem, as there is a potential male bias in the fill towards later days, making an already hard puzzle even more inaccessible for a certain gender.This first graph shows how the NYT is still very much male-dominated in terms of constructors, even as, since 2020, the number of female constructors are higher than the years before. Most crosswords are still made by male constructors.
The problem is exaserbated when the trend is split by the day of the week. Firstly, notice that the x-axis is generally by the days of the week, except for Sunday. Other than that, since the difficulty of the puzzle increases throughout the week, it is a possible reason why less women construct on those days. As discussed previously, the later days of the week are for crossword buffs, and hence, they may be less inclined to construct for those days, as it is a rather gated community. Earlier days in the week are indeed more accessible to them, which is why they may choose to pick those instead.
This is not to discount them as constructors, however. It may very well be the causes of external factors, like even personal preference that comes down to why they construct those puzzles less. However, it is still a problem, as there is a potential male bias in the fill towards later days, making an already hard puzzle even more inaccessible for a certain gender.
xxxxxxxxxxnrow=1ncol=2 fig = plt.figure(figsize=(16,5))gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.2)axes = gs.subplots(sharex=True, sharey=False)sns.lineplot(data=nytData,y="BNames",x="Year",hue="C1 Gender",ax=axes[0])axes[0].set_ylabel("No. of boy names")axes[0].set_title("Number of male names in NYT crosswords")axes[0].set_ylim((3,8))sns.lineplot(data=nytData,y="GNames",x="Year",hue="C1 Gender",ax=axes[1])axes[1].set_ylabel("No. of girl names")axes[1].set_title("Number of female names in NYT crosswords")axes[1].set_ylim((3,8))xxxxxxxxxxThis graph illustrates the importance of diversity in construction. Firstly, male names are used more than female names. Since USA Today has shown that more female names can be used, this is not exactly ideal. However, it still shows an important aspect of construction. Female constructors tend to use more female names than male constructors! Since they have grown up as a female, their life experiences and idols and likes would definitely be different. They bring a piece of their personality into the puzzle, incorporating what males may not characterise as common knowledge. While males have been trying to decrease the number of names they use, female constructors have beem capitalising on it, trying to integrate more of their character.This graph illustrates the importance of diversity in construction. Firstly, male names are used more than female names. Since USA Today has shown that more female names can be used, this is not exactly ideal. However, it still shows an important aspect of construction. Female constructors tend to use more female names than male constructors! Since they have grown up as a female, their life experiences and idols and likes would definitely be different. They bring a piece of their personality into the puzzle, incorporating what males may not characterise as common knowledge. While males have been trying to decrease the number of names they use, female constructors have beem capitalising on it, trying to integrate more of their character.
xxxxxxxxxxpivot=pd.pivot_table(data=nytData,index="C1 No.",columns="Day",values="Date",aggfunc="count")pivot.fillna(0,inplace=True)pivot=pivot.iloc[:10]plt.figure(figsize=(9,7))sns.heatmap(pivot[daysOfWeek], cmap="YlGnBu",linewidths=.5)plt.ylabel("Number of puzzles constructed by the constructor")plt.xlabel("Day of week")plt.title("Heatmap of which puzzles newer constructors construct")passxxxxxxxxxxWe can see that as the week goes on, less and less newer constructors make the puzzle. This may be caused by the relative difficulty of constructing such puzzles, which is known to increase through the week. The abnomaly of Sunday can be explained by its difficulty being more simlar to a Wednesday/Thursday puzzle. As the difficulty increases, so will the puzzlemaking difficulty be. Since they are newer constructors, they have a higher chance of being rejected, and so making an early week puzzle is safer. Jeff Chen of XWordInfo, a crossword site, recommends newbies to not dive straight into constructing late-week puzzles. As discussed before, this low density of newer constructors may be caused by NYT being the gold standard, and it being very difficult to be up to the high standards that they have immediately.We can see that as the week goes on, less and less newer constructors make the puzzle. This may be caused by the relative difficulty of constructing such puzzles, which is known to increase through the week. The abnomaly of Sunday can be explained by its difficulty being more simlar to a Wednesday/Thursday puzzle. As the difficulty increases, so will the puzzlemaking difficulty be. Since they are newer constructors, they have a higher chance of being rejected, and so making an early week puzzle is safer. Jeff Chen of XWordInfo, a crossword site, recommends newbies to not dive straight into constructing late-week puzzles. As discussed before, this low density of newer constructors may be caused by NYT being the gold standard, and it being very difficult to be up to the high standards that they have immediately.
xxxxxxxxxxnrow=1ncol=2 fig = plt.figure(figsize=(16,5))gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.2)axes = gs.subplots(sharex=True, sharey=False)plt.figure(figsize=(12,5))sns.lineplot(ax=axes[1],data=nytData[(nytData["C1 No."]<=10)],x="Year",y="ALong",color="b")sns.lineplot(ax=axes[1],data=nytData[(nytData["C1 No."]>10)],x="Year",y="ALong",color="g")axes[1].set_ylabel("Long Score")axes[1].set_title("Newer constructors have lower long scores to seasoned ones")sns.lineplot(ax=axes[0],data=nytData[(nytData["C1 No."]<=10)],x="Year",y="AShort",color="b")sns.lineplot(ax=axes[0],data=nytData[(nytData["C1 No."]>10)],x="Year",y="AShort",color="g")axes[0].set_ylabel("Short Score")axes[0].set_title("Newer constructors have quite similar short scores to seasoned ones")axes[1].legend(handles=[mpatches.Patch(color='b'),mpatches.Patch(color='g')],labels=["Newer","Seasoned"])passxxxxxxxxxxNewer constructors do have some slight disadvantage when it comes to freshness. But that is to be expected as they are newer to it. These trends are expected and are not surprising. This does justify why its hard for newer constructors to get accepted: they are facing stiff competition. That's not to say that the NYT does not accept them, they still try to aid newer constructors.Newer constructors do have some slight disadvantage when it comes to freshness. But that is to be expected as they are newer to it. These trends are expected and are not surprising. This does justify why its hard for newer constructors to get accepted: they are facing stiff competition. That's not to say that the NYT does not accept them, they still try to aid newer constructors.
xxxxxxxxxxplt.figure(figsize=(12,5))ax=sns.histplot(data=collab,x="Year",color="b",bins=range(1997,2023))plt.title("Number of collaborations have been increasing year on year")plt.ylabel("Number of collaborations")labels = [str(v) if v else '' for v in ax.containers[0].datavalues]ax.bar_label(ax.containers[0], labels=labels)passxxxxxxxxxxAs the years have gone by, the number of collaborations have increased. This may be caused by better communication, and the community generally interacting more. More collaborations can benefit everyone. For the constructors, they have a wider range of experiences and can be a more diverse puzzle, being a puzzle for one and all. For solvers, they get to see a more robust puzzle, elevating their solving experience.As the years have gone by, the number of collaborations have increased. This may be caused by better communication, and the community generally interacting more. More collaborations can benefit everyone. For the constructors, they have a wider range of experiences and can be a more diverse puzzle, being a puzzle for one and all. For solvers, they get to see a more robust puzzle, elevating their solving experience.
xxxxxxxxxxnrow=2ncol=2 fig = plt.figure(figsize=(16,5))gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.2)axes = gs.subplots(sharex=True, sharey=False)sns.lineplot(data=nytData,x="Year",y="AShort",ax=axes[0,0])sns.lineplot(data=collab,x="Year",y="AShort",color="g",ax=axes[0,0])axes[0,0].set_ylabel("Short Score")axes[0,0].set_title("Short score of crosswords over time, split by collaborations")sns.lineplot(data=nytData,x="Year",y="ALong",ax=axes[1,0])sns.lineplot(data=collab,x="Year",y="ALong",color="g",ax=axes[1,0])axes[1,0].set_ylabel("Long Score")axes[1,0].set_title("Long score of crosswords over time, split by collaborations")sns.lineplot(data=nytData[(nytData["C1 No."]<10)],x="Year",y="AShort",ax=axes[0,1])sns.lineplot(data=collab[(collab["C1 No."]<10) | (collab["C2 No."]<10)],x="Year",y="AShort",color="g",ax=axes[0,1])axes[0,1].set_ylabel("Short Score")axes[0,1].set_title("Short score of crosswords by newer constructors over time, split by collaborations")sns.lineplot(data=nytData[(nytData["C1 No."]<10)],x="Year",y="ALong",ax=axes[1,1])sns.lineplot(data=collab[(collab["C1 No."]<10) | (collab["C2 No."]<10)],x="Year",y="ALong",color="g",ax=axes[1,1])axes[1,1].set_ylabel("Long Score")axes[1,1].set_title("Long score of crosswords by newer constructors over time, split by collaborations")axes[0,0].set_ylim((55,65))axes[0,1].set_ylim((55,65))axes[1,0].set_ylim((200,700))axes[1,1].set_ylim((200,700))#generating the four corners of the small multipleplt.suptitle("How collaborations affect crossword quality")passxxxxxxxxxxOverall, for both seasoned and new constructors, collaborations do not change the amount of crosswordese.This may just be because crosswordese is a necessary evil for "good" crossword puzzles, and hence are unavoidable. What can be controlled by the constructor is the long score, the freshness factor. For both types of constructors, their freshness fill slightly increases when they collaborate with somebody. Although the increase is not significant, we need to look at how these answers are scored. Since we are using a wordlist, an increase of just 10 would signify that the answer is a "better" answer, a more trendy word that is used. Hence, any sort of improvement helps.Overall, for both seasoned and new constructors, collaborations do not change the amount of crosswordese. This may just be because crosswordese is a necessary evil for "good" crossword puzzles, and hence are unavoidable. What can be controlled by the constructor is the long score, the freshness factor. For both types of constructors, their freshness fill slightly increases when they collaborate with somebody. Although the increase is not significant, we need to look at how these answers are scored. Since we are using a wordlist, an increase of just 10 would signify that the answer is a "better" answer, a more trendy word that is used. Hence, any sort of improvement helps.
xxxxxxxxxx# Recommendations or Further Works<a id='recommendations'/>xxxxxxxxxx<div class="alert alert-block alert-warning">State any recommendations, improvements or further works.</div>xxxxxxxxxxRecommendations: <p>For crosswordese, nothing much can be done about it, it seems pretty much set in stone.For freshness, can and should try to do better, with the aid of computer programs to aid us.From the current trend, more inclusivity is needed. Various programs provide valuable mentorship to constructors and assist them to debut in a crossword. The most help is required in late week crosswords. More needs to be done, so that a better picture is reflected of who is in our crosswords.Some outlets have been seeing a shift. For example, the L.A. Times Daily has been increasing the publishing of female constructor's puzzles, trying to keep it above 50%. These sort of actions can hopefully induct more marginalised groups into the crosswords, making it a better community.Of course, some outlets simply cannot do that, as with the New York Times. However, they should try their best, to aid some crosswords that are not exactly up to par by more marginalised constructors to give them a chance. Even though this idea is some what unfair, it also serves as some affirmitive action. <p>Limitations: <p>This project has largely relied on my own transformed data. Even though I have made best efforts to consider why and how my transformation of data is justified, it still is not perfect. Much data used in this project was transformed from original data sources, so it may be that they are somewhat inaccurate. Having said that however, some of the findings in this project match up largely with what has already been found elsewhere, hence they can generally be considered reliable. One example of this real limitation is how long word scoring using my own scale was rather iffy. Additionally, the only dataset I could find for Q4 was from the New York Times, which may have biased the results and findings, may not apply to other outlets. <p>Future works <p>Further exploration can be done on Q4 for the other outlets, as I am missing that data from the other outlets. A more concrete and objective metric can be used, as my methods are not exactly perfect. As more and more crossword types and formats appear, one can try to do a similar project on other variations of the crossword. A famous and interesting example to try this on would be the cryptic crossword, mostly played in the UK.Recommendations:
For crosswordese, nothing much can be done about it, it seems pretty much set in stone. For freshness, can and should try to do better, with the aid of computer programs to aid us. From the current trend, more inclusivity is needed. Various programs provide valuable mentorship to constructors and assist them to debut in a crossword. The most help is required in late week crosswords. More needs to be done, so that a better picture is reflected of who is in our crosswords. Some outlets have been seeing a shift. For example, the L.A. Times Daily has been increasing the publishing of female constructor's puzzles, trying to keep it above 50%. These sort of actions can hopefully induct more marginalised groups into the crosswords, making it a better community. Of course, some outlets simply cannot do that, as with the New York Times. However, they should try their best, to aid some crosswords that are not exactly up to par by more marginalised constructors to give them a chance. Even though this idea is some what unfair, it also serves as some affirmitive action.
Limitations:
This project has largely relied on my own transformed data. Even though I have made best efforts to consider why and how my transformation of data is justified, it still is not perfect. Much data used in this project was transformed from original data sources, so it may be that they are somewhat inaccurate. Having said that however, some of the findings in this project match up largely with what has already been found elsewhere, hence they can generally be considered reliable. One example of this real limitation is how long word scoring using my own scale was rather iffy. Additionally, the only dataset I could find for Q4 was from the New York Times, which may have biased the results and findings, may not apply to other outlets.
Future works
Further exploration can be done on Q4 for the other outlets, as I am missing that data from the other outlets. A more concrete and objective metric can be used, as my methods are not exactly perfect. As more and more crossword types and formats appear, one can try to do a similar project on other variations of the crossword. A famous and interesting example to try this on would be the cryptic crossword, mostly played in the UK.
xxxxxxxxxx# AppendixAn appendix is included along with this project, which contains all the web scraping code.An appendix is included along with this project, which contains all the web scraping code.
xxxxxxxxxx<div class="alert alert-block alert-warning">Cite any references made, and links where you obtained the data.1. https://wordpress.com/support/markdown-quick-reference/ (you may refer to this link on markup for Jupyter when formatting your proposal)2. https://pudding.cool/2020/11/crossword/ (article that has tried looking at representation before)3. https://noahveltman.com/crossword/ (project that has tried looking at crosswordese before)4. https://www.nytimes.com/interactive/2016/02/07/opinion/what-74-years-of-times-crosswords-say-about-the-words-we-use.html (article by the New York Times looking at the evolution of words)Dataset references1. https://www.crosswordgiant.com/browse (website to scrape for the clue answer pairs)2. https://www.xwordinfo.com/ (contains more data about the NYT Crossword in particular)3. https://books.google.com/ngrams/ (for word searching)4. https://peterbroda.me/crosswords/wordlist/lists/peter-broda-wordlist__scored.txt (crossword construction wordlist by Peter Broda)5. https://drive.google.com/uc?export=download&id=1Ruxn8XzRNstU6sDPOMm_K72fVookrPPr (crossword construction wordlist by Brooke Husic and Enrique Henestroza Anguiano)6. https://www.verywellfamily.com/top-1000-baby-boy-names-2757618 (boy name list)7. https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 (girl name list) </div>Dataset references